27 research outputs found

    Biomedical ontology MeSH improves document clustering qualify on MEDLINE articles: A comparison study

    Get PDF
    19th IEEE International Symposium on Computer-Based Medical Systems, CBMS 2006, Salt Lake City, UTDocument clustering has been used for better document retrieval, document browsing, and text mining. In this paper, we investigate if biomedical ontology MeSH improves the clustering quality for MEDLINE articles. For this investigation, we perform a comprehensive comparison study of various document clustering approaches such as hierarchical clustering methods (single-link, complete-link, and complete link), Bisecting K-means, K-means, and Suffix Tree Clustering (STC) in terms of efficiency, effectiveness, and scalability. According to our experiment results, biomedical ontology MeSH significantly enhances clustering quality on biomedical documents. In addition, our results show that decent document clustering approaches, such as Bisecting K-means, K-means and STC, gains some benefit from MeSH ontology while hierarchical algorithms showing the poorest clustering quality do not reap the benefit of MeSH ontology

    A comprehensive comparison study of document clustering for a biomedical digital library MEDLINE

    Get PDF
    Presented at the 2006 ACM/IEEE Joint Conference on Digital Library (JCDL 2006), June 11-15, 2006, Chapel Hill, NC, USA. Retrieved 6/26/2006 from http://www.ischool.drexel.edu/faculty/thu/My%20Publication/Conference-papers/JCDL06.pdf.Document clustering has been used for better document retrieval, document browsing, and text mining in digital library. In this paper, we perform a comprehensive comparison study of various document clustering approaches such as three hierarchical methods (single-link, complete-link, and complete link), Bisecting K-means, K-means, and Suffix Tree Clustering in terms of the efficiency, the effectiveness, and the scalability. In addition, we apply a domain ontology to document clustering to investigate if the ontology such as MeSH improves clustering qualify for MEDLINE articles. Because an ontology is a formal, explicit specification of a shared conceptualization for a domain of interest, the use of ontologies is a natural way to solve traditional information retrieval problems such as synonym/hypernym/ hyponym problems. We conducted fairly extensive experiments based on different evaluation metrics such as misclassification index, F-measure, cluster purity, and Entropy on very large article sets from MEDLINE, the largest biomedical digital library in biomedicine

    A coherent biomedical literature clustering and summarization approach through ontology-enriched graphical representations

    Get PDF
    Data Warehousing and Knowledge Discovery, Proceedings 4081, pp. 374-383, DOI: http://dx.doi.org/10.1007/11823728In this paper, we introduce a coherent biomedical literature clustering and summarization approach that employs a graphical representation method for text using a biomedical ontology. The key of the approach is to construct document cluster models as semantic chunks capturing the core semantic relationships in the ontology-enriched scale-free graphical representation of documents. These document cluster models are used for both document clustering and text summarization by constructing Text Semantic Interaction Network (TSIN). Our extensive experimental results indicate our approach shows 45% cluster quality improvement and 72% clustering reliability improvement, in terms of misclassification index, over Bisecting K-means as a leading document clustering approach. In addition, our approach provides concise but rich text summary in key concepts and sentences. The primary contribution of this paper is we introduce a coherent biomedical literature clustering and summarization approach that takes advantage of ontologyenriched graphical representations. Our approach significantly improves the quality of document clusters and understandability of documents through summaries

    A semantic approach for mining hidden links from complementary and non-interactive biomedical literature

    Get PDF
    Presented at the 2006 SIAM Conference on Data Mining (SIAM DM 2006). Retrieved 6/26/2006 from http://www.ischool.drexel.edu/faculty/thu/My%20Publication/Conference-papers/SIAM06-Hu.pdf.Two complementary and non-interactive literature sets of articles, when they are considered together, can reveal useful information of scientific interest not apparent in either of the two sets alone. Swanson called the existence of such hidden links as undiscovered public knowledge (UPK). The novel connection between Raynaud disease and fish oils was uncovered from complementary and non-interactive biomedical literature by Swanson in 1986. Since then, there have been many approaches to uncover UPK by mining the biomedical literature. These earlier works, however, required substantial manual intervention to reduce the number of possible connections. This paper proposes a semantic-based mining model for undiscovered public knowledge using the biomedical literature. Our method replaces manual ad-hoc pruning by using semantic knowledge from the biomedical ontologies. Using the semantic types and semantic relationships of the biomedical concepts, our prototype system can identify the relevant concepts collected from Medline and generate the novel hypothesis between these concepts. The system successfully replicates Swanson’s two famous discoveries: Raynaud disease/fish oils and migraine/magnesium. Compared with previous approaches such as LSI-based and traditional association rule-based methods, our method generates much fewer but more relevant novel hypotheses, and requires much less human intervention in the discovery procedure

    Mining hidden connections among biomedical concepts from disjoint biomedical literature sets through semantic-based association rule

    Get PDF
    Paper accepted for publication in Journal of Information Systems. Retrieved 6/26/2006 from http://www.ischool.drexel.edu/faculty/thu/My%20Publication/Journal-papers/JIS_hu2006.pdf.The novel connection between Raynaud dise ase and fish oils was uncovered from two disjointed biomedical literature sets by Swanson in 1986. Since then, there have been many approaches to uncover novel connections by mining the biomedical literature. One of the popular approaches is to adapt the Association Rule (AR) method to automatically identify implicit novel connections between concept A and concept C from two disjointed sets of documents through intermediate B concept. Since A and C concepts do not occur together in the same data set , the mining goal is to find novel connection among A and C concepts in the disjoint data sets. It first applies association rul e to the two disjointed biomedical literature sets separately to generate two rule sets (AĂ B, BĂ C), and then applies transitive law to get the novel connection s AĂ C. However, this approach generates a huge number of possible connections among the millions of biomedical concepts and a lot of these hypothetical connections are spurious, useless and/or biologically meaningless. Thus it is essential to develop new approach to generate highly likely novel and biologically relevant connections among the biomedical concepts. This paper presents a Biomedical Semantic-based Association Rule System (Bio - SARS) that significantly reduce spurious/useless/biologically irrelevant connections through semantic filtering. Compared to other approaches such as LSI and traditional association rule-based approach, our approach generates much fewer rules and a lot of these rules represent relevant connections among biological concepts

    Enabling multi-level relevance feedback on PubMed by integrating rank learning into DBMS

    Get PDF
    Background: Finding relevant articles from PubMed is challenging because it is hard to express the user's specific intention in the given query interface, and a keyword query typically retrieves a large number of results. Researchers have applied machine learning techniques to find relevant articles by ranking the articles according to the learned relevance function. However, the process of learning and ranking is usually done offline without integrated with the keyword queries, and the users have to provide a large amount of training documents to get a reasonable learning accuracy. This paper proposes a novel multi-level relevance feedback system for PubMed, called RefMed, which supports both ad-hoc keyword queries and a multi-level relevance feedback in real time on PubMed. Results: RefMed supports a multi-level relevance feedback by using the RankSVM as the learning method, and thus it achieves higher accuracy with less feedback. RefMed "tightly" integrates the RankSVM into RDBMS to support both keyword queries and the multi-level relevance feedback in real time; the tight coupling of the RankSVM and DBMS substantially improves the processing time. An efficient parameter selection method for the RankSVM is also proposed, which tunes the RankSVM parameter without performing validation. Thereby, RefMed achieves a high learning accuracy in real time without performing a validation process. RefMed is accessible at http://dm.postech.ac.kr/refmed. Conclusions: RefMed is the first multi-level relevance feedback system for PubMed, which achieves a high accuracy with less feedback. It effectively learns an accurate relevance function from the user's feedback and efficiently processes the function to return relevant articles in real time.1114Nsciescopu

    ACKNOWLEDGEMENTS

    No full text
    I am indebted to many people for their support and advice to the successful completion of my Ph.D degree and this dissertation. My deepest gratitude goes to my supervisor, Dr. Xiaohua Hu, for his guidance and assistance with this dissertation as well as all the research during my doctoral research endeavor for the past four years. He has helped me to move forward with investigation in-depth and to remain focused on achieving my goal. I am grateful to my committee members, Dr. Il-Yeol Song, Dr. Xia Lin, Dr. Bahrad A. Sokhansanj, and Dr. Don Goelman, for their invaluable advice and suggestions. Especially, Dr. Song has always been meticulous in proofreading my research papers. His advice on both academic and non-academic matters has been inestimable. I would like to express my appreciation to my parents, SungTae Yoo and SunJa Park, and to my parents-in-law, TaeWhan Jung and SoonAe Goo, for their love, support and encouragement. I would like to express my sincere thanks to my wife YoungJae Jung for her love and sacrifice. Without her constant sacrifice, this thesis would not have bee
    corecore